Least-squares support-vector machines (LS-SVM) for

statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

and in

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...

ing, are

least-squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

versions of

support-vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...

s (SVM), which are a set of related

supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...

methods that analyze data and recognize patterns, and which are used for

classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

and

regression analysis In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one ...

. In this version one finds the solution by solving a set of

linear equation In mathematics, a linear equation is an equation that may be put in the form a_1x_1+\ldots+a_nx_n+b=0, where x_1,\ldots,x_n are the variables (or unknowns), and b,a_1,\ldots,a_n are the coefficients, which are often real numbers. The coefficien ...

s instead of a convex

quadratic programming Quadratic programming (QP) is the process of solving certain mathematical optimization problems involving quadratic functions. Specifically, one seeks to optimize (minimize or maximize) a multivariate quadratic function subject to linear constra ...

(QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle. LS-SVMs are a class of kernel-based learning methods.

From support-vector machine to least-squares support-vector machine

Given a training set

\_^N

with input data

x_i \in \mathbb^n

and corresponding binary class labels

y_i \in \

, the SVM classifier, according to

Vapnik Vladimir Naumovich Vapnik (russian: Владимир Наумович Вапник; born 6 December 1936) is one of the main developers of the Vapnik–Chervonenkis theory of statistical learning, and the co-inventor of the support-vector machin ...

's original formulation, satisfies the following conditions: Data spiral

\begin
   w^T \phi (x_i ) + b \ge 1, & \text \quad y_i  =  +1, \\
   w^T \phi (x_i ) + b \le  - 1, & \text \quad y_i  =  -1,
\end

which is equivalent to :

\ge 1,\quad i = 1, \ldots, N,

where

\phi(x)

is the nonlinear map from original space to the high- or infinite-dimensional space.

Inseparable data

In case such a separating hyperplane does not exist, we introduce so-called slack variables

\xi_i

such that :

\ge 1 - \xi _i , & i = 1, \ldots, N, \\ \xi _i \ge 0, & i = 1, \ldots, N. \end

According to the structural risk minimization principle, the risk bound is minimized by the following minimization problem: :

\min J_1 (w,\xi )=\fracw^T w + c\sum\limits_^N \xi_i ,

\ge 1 - \xi _i , & i = 1, \ldots, N, \\ \xi _i \ge 0, & i = 1, \ldots ,N , \end

To solve this problem, we could construct the Lagrangian function: :

L_1(w,b,\xi,\alpha,\beta)=\fracw^T w + c\sum\limits_^N  - \sum\limits_^N \alpha_i \left\ - \sum\limits_^N \beta_i \xi_i,

where

\alpha_i \ge 0,\ \beta _i \ge 0\ (i = 1, \ldots, N)

are the

Lagrangian multipliers In mathematical optimization, the method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to equality constraints (i.e., subject to the condition that one or more equations have to be satisfied ex ...

. The optimal point will be in the

saddle point In mathematics, a saddle point or minimax point is a point on the surface of the graph of a function where the slopes (derivatives) in orthogonal directions are all zero (a critical point), but which is not a local extremum of the function ...

of the Lagrangian function, and then we obtain :

\begin
 \frac = 0\quad  \to \quad w = \sum\limits_^N \alpha _i y_i \phi (x_i )  ,\\
 \frac = 0\quad  \to \quad \sum\limits_^N \alpha _i y_i = 0 ,\\
 \frac = 0\quad  \to \quad 0 \le \alpha _i +\beta_i \le c,\;i = 1, \ldots ,N .
 \end

By substituting

w

by its expression in the Lagrangian formed from the appropriate objective and constraints, we will get the following quadratic programming problem: :

\max Q_1(\alpha) = -\frac\sum\limits_^N   + \sum\limits_^N \alpha_i,

where

K(x_i ,x_j ) = \left\langle \phi (x_i ), \phi (x_j) \right\rangle

is called the

kernel function In operator theory, a branch of mathematics, a positive-definite kernel is a generalization of a positive-definite function or a positive-definite matrix. It was first introduced by James Mercer in the early 20th century, in the context of solving ...

. Solving this QP problem subject to constraints in (8), we will get the

hyperplane In geometry, a hyperplane is a subspace whose dimension is one less than that of its ''ambient space''. For example, if a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyper ...

in the high-dimensional space and hence the classifier in the original space.

Least-squares SVM formulation

The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as :

\min J_2(w,b,e) = \frac w^T w + \frac\sum\limits_^N e_i^2,

subject to the equality constraints :

= 1 - e_ ,\quad i = 1, \ldots ,N .

The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression interpretation with binary targets

y_i = \pm 1

. Using

y_i^2 = 1

, we have :

\sum\limits_^N e_i^2 = \sum\limits_^N (y_i e_i)^2 = \sum\limits_^N e_i^2 = \sum\limits_^N \left( y_i - (w^T \phi(x_i) + b) \right)^2,

with

e_i = y_i - (w^T \phi(x_i) + b).

Notice, that this error would also make sense for least-squares data fitting, so that the same end results holds for the regression case. Hence the LS-SVM classifier formulation is equivalent to :

J_2(w,b,e) = \mu E_W + \zeta E_D

with

E_W = \frac w^T w

and

E_D = \frac \sum\limits_^N e_i^2 = \frac \sum\limits_^N \left(y_i - (w^T \phi(x_i) + b) \right)^2.

Both

\mu

and

\zeta

should be considered as hyperparameters to tune the amount of regularization versus the sum squared error. The solution does only depend on the ratio

\gamma = \zeta / \mu

, therefore the original formulation uses only

\gamma

as tuning parameter. We use both

\mu

and

\zeta

as parameters in order to provide a Bayesian interpretation to LS-SVM. The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function: :

\begin
 L_2 (w,b,e,\alpha )\; = J_2 (w,e) - \sum\limits_^N \alpha _i \left\  ,\\
 \quad \quad \quad \quad \quad \; = \fracw^T w + \frac \sum\limits_^N e_i^2 - \sum\limits_^N \alpha _i \left\ ,
 \end

where

\alpha_i \in \mathbb

are the Lagrange multipliers. The conditions for optimality are :

\begin
 \frac = 0\quad  \to \quad w = \sum\limits_^N \alpha _i \phi (x_i ) , \\
 \frac = 0\quad  \to \quad \sum\limits_^N \alpha _i   = 0 ,\\
 \frac = 0\quad  \to \quad \alpha _i  =  \gamma e_i ,\;i = 1, \ldots ,N ,\\
 \frac = 0\quad  \to \quad y_i  = w^T \phi (x_i ) + b + e_i ,\,i = 1, \ldots ,N .
 \end

Elimination of

w

and

e

will yield a

linear system In systems theory, a linear system is a mathematical model of a system based on the use of a linear operator. Linear systems typically exhibit features and properties that are much simpler than the nonlinear case. As a mathematical abstraction o ...

instead of a

problem: :

\left \begin
   0 & 1_N^T  \\
   1_N & \Omega  + \gamma ^ I_N
\end \right \left \begin
   b  \\
   \alpha
\end \right = \left \begin
   0  \\
   Y
\end \right,

with

Y =_1 , \ldots ,y_N T

1_N  =, \ldots ,1 T

and

\alpha  = alpha _1 , \ldots ,\alpha _N T

. Here,

I_N

is an

N \times N

identity matrix In linear algebra, the identity matrix of size n is the n\times n square matrix with ones on the main diagonal and zeros elsewhere. Terminology and notation The identity matrix is often denoted by I_n, or simply by I if the size is immaterial o ...

, and

\Omega  \in \mathbb^

is the kernel matrix defined by

\Omega _  = \phi (x_i )^T \phi (x_j ) = K(x_i ,x_j )

Kernel function ''K''

For the kernel function ''K''(•, •) one typically has the following choices: *

Linear Linearity is the property of a mathematical relationship (''function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear r ...

kernel :

K(x,x_i ) = x_i^T x,

Polynomial In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An exa ...

kernel of degree

d

K(x,x_i ) = \left(  \right)^d ,

Radial basis function A radial basis function (RBF) is a real-valued function \varphi whose value depends only on the distance between the input and some fixed point, either the origin, so that \varphi(\mathbf) = \hat\varphi(\left\, \mathbf\right\, ), or some other fixed ...

RBF kernel :

K(x,x_i ) = \exp \left(  \right),

* MLP kernel :

K(x,x_i ) = \tanh \left(  \right),

where

d

c

\sigma

k

and

\theta

are constants. Notice that the Mercer condition holds for all

c, \sigma \in \mathbb^+

and

d \in N

values in the

polynomial In mathematics, a polynomial is an expression consisting of indeterminates (also called variables) and coefficients, that involves only the operations of addition, subtraction, multiplication, and positive-integer powers of variables. An exa ...

and RBF case, but not for all possible choices of

k

and

\theta

in the MLP case. The scale parameters

c

\sigma

and

k

determine the scaling of the inputs in the polynomial, RBF and MLP

. This scaling is related to the bandwidth of the kernel in

, where it is shown that the bandwidth is an important parameter of the generalization behavior of a kernel method.

Bayesian interpretation for LS-SVM

Bayesian Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister. Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...

interpretation of the SVM has been proposed by Smola et al. They showed that the use of different kernels in SVM can be regarded as defining different

prior probability In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...

distributions on the functional space, as

\propto \exp \left( \right)

. Here

\beta>0

is a constant and

\hat

is the regularization operator corresponding to the selected kernel. A general Bayesian evidence framework was developed by MacKay,MacKay, D. J. C. The evidence framework applied to classification networks. Neural Computation, 4(5): 720–736, Sep. 1992. and MacKay has used it to the problem of regression, forward

neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...

and classification network. Provided data set

D

, a model

\mathbb

with parameter vector

w

and a so-called hyperparameter or regularization parameter

\lambda

Bayesian inference Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, a ...

is constructed with 3 levels of inference: * In level 1, for a given value of

\lambda

, the first level of inference infers the posterior distribution of

w

by Bayesian rule ::

p(w, D,\lambda ,\mathbb) \propto p(D, w,\mathbb)p(w, \lambda ,\mathbb).

* The second level of inference determines the value of

\lambda

, by maximizing ::

p(\lambda , D,\mathbb) \propto p(D, \lambda ,\mathbb)p(\lambda , \mathbb).

* The third level of inference in the evidence framework ranks different models by examining their posterior probabilities ::

p(\mathbb, D) \propto p(D, \mathbb)p(\mathbb).

We can see that Bayesian evidence framework is a unified theory for

learning Learning is the process of acquiring new understanding, knowledge, behaviors, skills, value (personal and cultural), values, attitudes, and preferences. The ability to learn is possessed by humans, animals, and some machine learning, machines ...

the model and model selection. Kwok used the Bayesian evidence framework to interpret the formulation of SVM and model selection. And he also applied Bayesian evidence framework to support vector regression. Now, given the data points

\ _^N

and the hyperparameters

\mu

and

\zeta

of the model

\mathbb

, the model parameters

w

and

b

are estimated by maximizing the posterior

p(w,b, D,\log \mu ,\log \zeta ,\mathbb)

. Applying Bayes’ rule, we obtain :

p(w,b, D,\log \mu ,\log \zeta ,\mathbb) = \frac,

where

p(D, \log \mu ,\log \zeta ,\mathbb)

is a normalizing constant such the integral over all possible

w

and

b

is equal to 1. We assume

w

and

b

are independent of the hyperparameter

\zeta

, and are conditional independent, i.e., we assume :

p(w,b, \log \mu ,\log \zeta ,\mathbb) = p(w, \log \mu ,\mathbb)p(b, \log \sigma _b ,\mathbb).

When

\sigma _b  \to \infty

, the distribution of

b

will approximate a uniform distribution. Furthermore, we assume

w

and

b

are Gaussian distribution, so we obtain the a priori distribution of

w

and

b

with

\sigma _b  \to \infty

to be :

\begin
 p(w,b, \log \mu ,) = \left(  \right)^ \exp \left(  \right)\frac\exp \left(  \right) \\
 \quad \quad \quad \quad \quad \quad \quad  \propto \left(  \right)^ \exp \left(  \right)
 \end .

Here

n_f

is the dimensionality of the feature space, same as the dimensionality of

w

. The probability of

p(D, w,b,\log \mu ,\log \zeta ,\mathbb)

is assumed to depend only on

w,b,\zeta

and

\mathbb

. We assume that the data points are independently identically distributed (i.i.d.), so that: :

p(D, w,b,\log \zeta ,\mathbb) = \prod\limits_^N  .

In order to obtain the least square cost function, it is assumed that the probability of a data point is proportional to: :

p(x_i ,y_i , w,b,\log \zeta ,\mathbb) \propto p(e_i , w,b,\log \zeta ,\mathbb) .

A Gaussian distribution is taken for the errors

e_i  = y_i  - (w^T \phi (x_i ) + b)

as: :

p(e_i , w,b,\log \zeta ,\mathbb) = \sqrt  \exp \left(  \right) .

It is assumed that the

w

and

b

are determined in such a way that the class centers

\hat m_ -

and

\hat m_ +

are mapped onto the target -1 and +1, respectively. The projections

w^T \phi (x) + b

of the class elements

\phi(x)

follow a multivariate Gaussian distribution, which have variance

1/ \zeta

. Combining the preceding expressions, and neglecting all constants, Bayes’ rule becomes :

p(w,b, D,\log \mu ,\log \zeta ,\mathbb) \propto \exp ( - \fracw^T w - \frac\sum\limits_^N  ) = \exp ( - J_2 (w,b)) .

The maximum posterior density estimates

w_

and

b_

are then obtained by minimizing the negative logarithm of (26), so we arrive (10).

References

Bibliography

* J. A. K. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, J. Vandewalle, Least Squares Support Vector Machines, World Scientific Pub. Co., Singapore, 2002. * Suykens J. A. K., Vandewalle J., Least squares support vector machine classifiers, ''Neural Processing Letters'', vol. 9, no. 3, Jun. 1999, pp. 293–300. * Vladimir Vapnik. ''The Nature of Statistical Learning Theory''. Springer-Verlag, 1995. {{ISBN, 0-387-98780-0 * MacKay, D. J. C., Probable networks and plausible predictions—A review of practical Bayesian methods for supervised neural networks. ''Network: Computation in Neural Systems'', vol. 6, 1995, pp. 469–505.

External links

www.esat.kuleuven.be/sista/lssvmlab/
"Least squares support vector machine Lab (LS-SVMlab) toolbox contains Matlab/C implementations for a number of LS-SVM algorithms".
www.kernel-machines.org
"Support Vector Machines and Kernel based methods (Smola & Schölkopf)".
www.gaussianprocess.org
"Gaussian Processes: Data modeling using Gaussian Process priors over functions for regression and classification (MacKay, Williams)".
www.support-vector.net
"Support Vector Machines and kernel based methods (Cristianini)".

Contains a least-squares SVM implementation for large-scale datasets. Support vector machines Classification algorithms Statistical classification Least squares